Two-speed genome evolution drives pathogenicity in fungal pathogens of animals

Significance Batrachochytrium salamandrivorans (Bsal) and its closest relative B. dendrobatidis (Bd) are fungal pathogens that threaten amphibians globally. Pathogenicity in vertebrates by species of Batrachochytrium is thought to have emerged from nonpathogenic and saprobic relatives over millions of years through gene expansions of secreted proteolytic enzymes families. Using deep nanopore sequencing and comparative genomics, we discover that Batrachochytrium genomes have undergone a repeat-driven expansion characterized by flanking repetitive elements enriched around pathogenicity genes, genes with signatures of positive selection, and genes upregulated during infection. These genomic features are the hallmarks of two-speed genomes that have to date only been described in plant pathogens. These discoveries shed new light on the evolution of fungal pathogens of vertebrates driving global declines and extinctions.


Variant calling and quadrant analysis
We identified 167,709 heterozygous positions (2.3 per kb) across the Bsal genome, which is about half the previously predicted amount of 4.96 per kb (2), perhaps owing to the better resolution and separation of repetitive regions and gene families.
These positions fell across every chromosome in the genome, supporting Bsal being diploid ( Figure S4).
Using the custom discrete-time pattern Markov chain approach (https://github.com/rhysf/2speed_genomes), we identified significant stretches of genes belonging to QLL or QSS for each of the chytrid genomes. The two most significant of these in Bsal were scaffold 334 with 15 consecutive genes in QLL (total genes on contig = 380, p = 2.19E -6 ) and scaffold 320 with 16 consecutive genes in QLL (total genes on contig = 122, p = 2.77E -6 ). The 15 consecutive QLL genes on scaffold 334 included only 1 gene encoding a secreted protein (Tribe 536, with a kelch4 galactose oxidase central domain). The 16 consecutive QLL genes on scaffold 320 included 6 genes encoding secreted proteins (three belonged to Tribe 31 (unknown function), and two belonged to Tribe 17 with a S1-P1 nuclease domain).

Autonomous and fully functional Class I and II transposable elements in Bsal
For a transposable element (TE) to remain actively transposing, it must encode several genes (3). To explore the activity of TEs, we searched for a range of PFAM and Conserved Domains Database (CDD) domains known to be present in active TEs (Dataset S2), as well as the reverse transcriptase (RT) gene. Of the 21 LINE and 42 LTR families found in Bsal, we found 28.6% of LINEs (n = 6, including BatrLINE-1) and 35.7% of LTRs (n = 15) have the potential to be fully functional (e-value < 1E -3 for both Pfam hmmsearch and rspblast CDD search). LINEs were considered to be capable of mobilization if they feature at least one RT and one apurinic endonuclease domain (3).
LTRs were considered to be capable of mobilization if they feature domains for GAG, a structural protein for virus-like particles, and for POL, which encodes an aspartic proteinase (AP), RT, and DDE integrase (INT) (3). We found that 90.5% of Bsal's LINE repeat families and 97.6% of Bsal's LTR families have a recognizable RT domain (Dataset S2). Notably, all LTR repeat families considered to be fully functional and autonomous (based on their consensus) belong to Gypsy elements. Of the 20 Class II 7

Sequencing and library preparation
Bsal zoosporangia and zoospores were cultured in tryptone-gelatin hydrolysatelactose (TGhL) broth in cell culture flasks at 18°C. 200ml of 6 days old cultures were harvested and centrifugated at 1700g for 10 mins at 4°C. Cell pellet was washed with ice cold water and snap frozen in liquid nitrogen. High-molecular weight DNA for Nanopore sequencing was obtained by a customized cetyltrimethylammonium bromide (CTAB) extraction procedure (10,11) with the modification of using RNase A (T3018, NEB) instead of RNase T1. Briefly, cell pellet was ground with a mortar and pestle in liquid nitrogen with 2g of sand, followed by lysis with CTAB, two-step purification with phenol/chloroform/isoamyl alcohol and precipitation with isopropanol. Care was taken to avoid DNA shearing (cut off tips, no heating of samples). DNA concentration was checked using the Qubit BR assay (Invitrogen) and DNA size range profile was checked by Tapestation gDNA screentape (Agilent).
Two independent sequencing libraries were constructed, one with long unfragmented DNA, one with DNA fragmented to 12kb with a gTube (520079, Covaris).
Adapters were ligated using the Genomic DNA by Ligation kit (SQK-LSK109, Oxford Nanopore Technologies) and NEBNext Quick T4 DNA Ligase (E6056, NEB) followed by The longest read was 318,012 nt long.
Ploidy was assessed using Smudgeplot (22) (using kmc to generate kmer databases (23)) and allele frequency plots. Using the kmer histograms generated using kmc for the respective kmer size and the proposed ploidy of 2, genome length was estimated using GenomeScope (22). In case of kmer size 21, GenomeScope estimated the genome length to be 167,775,294 bp, with the length estimate not reaching significance (p = 0.7717) and the model fit reaching 31.7-58.5%. For kmer size 30, the model did not converge. Likely the highly repetitive nature of Bsal's genome accounts for estimate that is more than twice the size of our assembly. We called variants using the Illumina Bsal reads from (2) aligned to the new genome using BWA v.0.7.4 and Pilon v1.9 (16) (parameters -diploid, --vcf) with the diploid setting, and filtered all sites labelled as 'LowCov', 'Amb' or 'Del'.
Gene predictions were checked for a variety of issues, including overlapping of noncoding genes, overlapping of coding genes, and the presence of in-frame stops.
Genes were named according to evidence from BLASTx and HMMER in the following order of precedence: (i) Swiss-Pro (30), (ii) TIGRfam (36), and (iii) KEGG (31) (where BLASTx hits must meet the 70% identity and 70% overlap criteria to be considered a good hit and for the name to be applied). Otherwise, genes were classified as hypothetical proteins. Genes were functionally annotated by assigning PFAM (release 27) domains (37), and BLASTx for KEGG assignment (each where E-value <1x10 -10 ), as well as ortholog mapping to genes of known function. SignalP 4.0 (38) and TMHMM (39) were used to identify secreted proteins and transmembrane proteins, respectively.
The protease composition of each chytrid was determined using top high scoring pairs from BLASTp searches (e-value < 1e -5 ) made to the file 'pepunit.lib', which is a non-redundant library of protein sequences of all the peptidases and peptidase inhibitors that are included in the MEROPS database (Release 12.1), and compared to the 447 thousand protein sequences in the 2014 version we used in our previous genomic analyses (2). All proteases with matches to M36 metalloproteases were aligned using MUSCLE v3.8.31 (40) and trimmed of excess gaps using trimAl 1.2rev59 (41) gappyout. We constructed the gene trees with RAxML v7.7.8(42) using the JTT amino acid transition model, which was visualized using iTOL v6 (43).
Secreted proteins were predicted in each of the 22 chytrid species using SignalP 4.0 with the 'eukaryote' organism type and otherwise default settings (38). These gene sequences had their secretion signal cleaved according to the predicted cleavage site, which were then BLASTp (all vs all) with default parameters, and clustered according to sequence similarity using MCL (http://micans.org/mcl/man/clmprotocols.html) with recommended inflation setting '-I 1.4'. The MCL program mcxdump was then used with default settings to output clusters. Secreted families were classified using PFAM domains (release 34.0) (33). Small secreted proteins (SSPs) were classified as those secreted proteins with <300 amino acids and >4 cysteines.

Chytridiomycota genomic and phylogenetic analysis
The genomes and gene annotation for B. dendrobatidis (Bd) JEL423, S. Single copy orthologs were identified between chytrids using the Synima (50) pipeline with Orthofinder, and aligned using MUSCLE v3.8.31 (40). A maximum likelihood tree was constructed using IQ-Tree v1.6.12 (51) (56). The output of Repeatmodeller (consensi.fa.classified) was then used as a library for RepeatMasker. The repeat content and family distribution for each chytrid species was determined from RepeatMasker.out, excluding lower scoring matches whose domain partly (<80%) includes the domain of another match.
TE and repeat distributions in the genome were assessed using gff files converted from the Repeatmasker .out files with a custom script that stringently excludes lower scoring matches whose domain partly (<80%) includes the domain of another match and visualized using IGV 2.8.2 (57). Repeatmaskers -GFF option does include those lower scoring, overlapping repeats and is therefore not suitable for further in depth repeat distribution analysis. TE distribution in relation to GC content was analyzed using Pilon's GC.wig files. Additionally, TE and repeat content of 10kb windows assigned to different quadrants was calculated using custom scripts based on RepeatMasker .out files and lists of genes assigned to quadrants (https://bitbucket.org/Theresa_42/wackeretal_2022_bsal_2speedgenome/).
To assess if repeat content was correlated to genome assembly quality, Spearman's Rank Correlation Coefficients, Spearman's correlation and linear regression (linear model fitting based on formula by Wilkinson and Rogers (1973) (58)) were calculated between repeat content and N50, the number of contigs, genome length, as well as the number of genes using ggpubr (https://github.com/kassambara/ggpubr). This was done for both all genome assemblies, as well as genome assemblies partitioned by sequencing technology.
For the heatmap showing repeat superfamily profiles, heatmap.2 was used with hierarchical clustering and Euclidian as a distance measure. Repeat families that did not exceed 1% abundance in any of the chytrids were excluded. Repeat families in Bd and Bsal were aligned using blastn with an e-value of 0.01, no filtering (-dust no, -soft_masking false) and a wordsize of 7 to determine homologs. Telomeric sequences were manually curated.

Analysis of RIP activity and RNAi machinery
To assess if Bsal's genome has evidence of repeat induced point mutations (RIP), all individual non-consensus entries for each repeat family were extracted from the genome and aligned using MAFFT v7.490 (--localpair and -reorder options) (68).
Alignments were then tested for signatures of RIP activity using RIPcal (6) with the degenerate consensus sequence as reference, as well as dinucleotide frequencies of core conserved chytrid genes, aligned with MAFFT (--localpair and -reorder options). Pfam domains: PF00145) as described before (72). Three candidate C-5 cytosine methyltransferases were identified: BSLG_000286, BSLG_000626 and BSLG_010741.
They were aligned to 15 known RID proteins (Dataset S3) using MAFFT as described (72) before and screened for an amino acid change of NV to QT or ET in motif VI and visualized in Jalview 2.11.2.5 (73). All three putative RID C-5 cytosine methyltransferases did not have an amino acid change in motif VI ( Figure S11).
Additionally, homologs of 15 known RID proteins were searched for using blastp (evalue < 1e -5 ) (63). No homologs were identified.  (79). Briefly, raw sequences were preprocessed by mapping reads to the reference genome Bd JEL423 using BWA-MEM v.0.7.17 (80). Next, duplicates were marked, and the resulting file was sorted by coordinate order. Intervals were created using a custom bash script allowing parallel analysis of large batches of genomics data. Using the scatter-gather approach,  Consecutive gene counts were generated using lists of genes assigned to their quadrants as defined above and a bespoke script. To assess the significance of finding a sequence (n) of any given length (k) of consecutive genes of the same quadrant, a discrete pattern Markov chains were used. The probability of transitioning from one quadrant to the next was set to 0.25. Based on that, a (k+1)(k+1) transition matrix was generated. Once the transition matrix was constructed, for a given value of the probability of having the number of consecutive genes of a certain quadrant in the chain was ( | ) = { }0,k. In the calculation, was set to 100 repetitions of equiprobable outcome. is the event of the occurrence of k consecutive genes of the same quadrant.